A baseline analysis of population and income was conducted. The histogram for population appeared skewed to the right. The different census tracts had similar population counts with a mean of about 4000. Counties were not evenly spread out as some had a population of 1 million and others 10 million. With similar populations, census tracts were easier to investigate instead of counties. The Q-Q plot confirmed the non-normality as the values between quartiles 3 and 4 were far away from the line.
The raw data for income appeared very skewed to the right as well. The data appeared to follow a power-law curve as some individuals have amassed a large amount of income and these outliers can skew the data. Thus, the outliers and NA values were removed; checking again, the “cleaned data” appeared normal. The histogram appears monomodal and the error terms along the Q-Q plot did not stray away from the line.
## [1] "26139.73"
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 128 18776 24730 26140 32247 56040 3589
## [1] 10274.98
Next, the seventeen independent variables were analyzed. The freedom scores were economic freedom, personal freedom, regulatory policy, fiscal policy, and overall freedom. The box plots were split up into four evenly distributed quartiles by the income per capita in each quartile. For all the five sets of boxplots, there did not appear to be any differences between the quartiles as they all overlapped roughly the same range of their respective independent variables. The histograms did not appear normal as overall the data was randomly spread out with huge gaps between bins. The Q-Q plots told a similar story as the error terms tended to follow a sin-like trend over the line and there were big tails on either end. None of the freedom scores appeared to be distributed normally.
## Observations per group: 18417, 17756, 22780, 13244. 1064 missing.
## Factor w/ 4 levels "[-0.827,-0.256]",..: 3 3 3 3 3 3 3 3 3 3 ...
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## -0.8272 -0.2556 0.0152 -0.0866 0.1376 0.3550 1064
## Observations per group: 18061, 18704, 20888, 14544. 1064 missing.
## Factor w/ 4 levels "[-0.0444,0.0135]",..: 1 1 1 1 1 1 1 1 1 1 ...
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## -0.0444 0.0135 0.0803 0.0641 0.1064 0.2450 1064
## Observations per group: 19180, 17739, 17836, 17442. 1064 missing.
## Factor w/ 4 levels "[-0.457,-0.223]",..: 3 3 3 3 3 3 3 3 3 3 ...
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## -0.4569 -0.2228 -0.0737 -0.1511 -0.0322 0.0715 1064
## Observations per group: 18164, 19958, 16580, 17495. 1064 missing.
## Factor w/ 4 levels "[-0.37,-0.0602]",..: 3 3 3 3 3 3 3 3 3 3 ...
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## -0.3702 -0.0602 0.0634 0.0646 0.1767 0.4024 1064
## Observations per group: 18440, 17906, 18395, 17456. 1064 missing.
## Factor w/ 4 levels "[-0.814,-0.102]",..: 2 2 2 2 2 2 2 2 2 2 ...
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## -0.8136 -0.1021 0.0652 -0.0224 0.1637 0.4614 1064
Next the seven variables for work variations (professional, production, unemployment, office, service, construction, self-employed) were assessed for normality. The boxplots that exhibited a decrease in income, as more of the specific work variation was included in the census tract, were unemployment, service, construction, and production. That is to say, as more unemployed individuals were accounted for in a given census tract, the income per capita decreased. The only work variation that exhibited an increase in average income was professional work. The remaining variables of office and self-employed remained relatively stable across quartiles. Looking at the histograms of each of the variables it appeared that only the proportion of professionals was distributed normally. The remaining six work variations were all skewed to the right. For professionals, the Q-Q plots affirmed the normality as the plot did not have the error terms straying far from the line with very small right and left tails. The same cannot be said for the other variables as each had an oversized right tail and a relatively small left tail. Overall the proportion of professionals appeared normally distributed while the other work variations did not.
## Observations per group: 18765, 18330, 18095, 17970. 101 missing.
## Factor w/ 4 levels "[0,5.1]","(5.1,7.7]",..: 2 4 2 3 1 3 3 3 3 2 ...
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.000 5.100 7.700 9.028 11.400 100.000 101
## Observations per group: 18379, 18323, 18173, 18281. 105 missing.
## Factor w/ 4 levels "[0,24.1]","(24.1,32.6]",..: 3 1 2 2 4 2 1 3 2 2 ...
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.0 24.1 32.6 34.8 43.8 100.0 105
## Observations per group: 18303, 18828, 17766, 18259. 105 missing.
## Factor w/ 4 levels "[0,20.1]","(20.1,23.8]",..: 2 2 2 3 1 4 3 4 3 1 ...
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.00 20.10 23.80 23.95 27.50 100.00 105
## Observations per group: 18672, 18068, 18275, 18141. 105 missing.
## Factor w/ 4 levels "[0,13.5]","(13.5,17.9]",..: 2 4 4 3 2 2 4 1 2 2 ...
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.0 13.5 17.9 19.1 23.6 100.0 105
## Observations per group: 18588, 18209, 18113, 18246. 105 missing.
## Factor w/ 4 levels "[0,5]","(5,8.4]",..: 3 3 3 3 1 2 3 2 2 3 ...
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.000 5.000 8.400 9.295 12.500 100.000 105
## Observations per group: 18566, 18226, 18085, 18279. 105 missing.
## Factor w/ 4 levels "[0,7.1]","(7.1,11.8]",..: 3 4 3 3 3 3 3 1 3 4 ...
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.00 7.10 11.80 12.86 17.40 100.00 105
## Observations per group: 19155, 17839, 18195, 17967. 105 missing.
## Factor w/ 4 levels "[0,3.6]","(3.6,5.5]",..: 2 3 4 1 2 3 1 3 2 3 ...
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.000 3.600 5.500 6.227 8.100 100.000 105
## Individual EDA of ethnicities Finally the five ethnic variables (Native, White, Black, Hispanic, and Asian) were investigated. The boxplots for White showed an increase in average income between the first second and third quartiles but no change in the fourth. The boxplot for Asian showed an increase from the first through the fourth quartile. The boxplots for Hispanic slightly increased between the first and second quartile but did not change for the third quartile. The fourth quantile for Hispanic decreased significantly. The boxplot for Black increased in average income between the first and second quartile. Then there was a decrease in average income from the second to the fourth quartiles. Overall, it appeared that average income did change based on concentration of ethnicities in a census tract. The histogram for White was bimodal with the highest frequency at over 8,000. The histograms for the other four ethnicities were skewed to the right. Based on the histogram, it appeared that white had the highest responses followed by Hispanic, Black, Asian, and Native. All of the error terms along the Q-Q plot line for each of the ethnicity variables followed a curve with large left and right tails. Also, there were not enough responses from the Native ethnicity to construct a meaningful boxplot. For the native Q-Q plot, there was a clear pattern of the error terms along the line implying non-normality. Therefore, based on the assessment of the boxplots, histograms, and Q-Q plots, none of the ethnicities appear normally distributed.
## Observations per group: 18430, 18252, 18303, 18276. 0 missing.
## Factor w/ 4 levels "[0,0.7]","(0.7,3.7]",..: 3 4 4 2 4 3 4 3 3 3 ...
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00 0.70 3.70 13.27 14.40 100.00
## Observations per group: 18472, 18246, 18233, 18310. 0 missing.
## Factor w/ 4 levels "[0,2.4]","(2.4,7]",..: 1 1 1 3 1 3 2 1 1 1 ...
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00 2.40 7.00 16.86 20.40 100.00
## Observations per group: 20124, 17253, 17651, 18233. 0 missing.
## Factor w/ 4 levels "[0,0.2]","(0.2,1.4]",..: 2 3 2 1 3 1 1 1 1 2 ...
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00 0.20 1.40 4.59 4.80 91.30
## Observations per group: 18351, 18331, 18285, 18294. 0 missing.
## Factor w/ 4 levels "[0,39.4]","(39.4,71.4]",..: 3 2 3 3 2 3 3 3 4 3 ...
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00 39.40 71.40 62.03 88.30 100.00
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0000 0.0000 0.7279 0.4000 100.0000
## corrplot 0.84 loaded
## Call:
## aov(formula = IncomePerCap ~ EthnicPlurality, data = anova_dat)
##
## Terms:
## EthnicPlurality Residuals
## Sum of Squares 1.632113e+12 5.681161e+12
## Deg. of Freedom 4 69566
##
## Residual standard error: 9036.911
## Estimated effects may be unbalanced
## 3589 observations deleted due to missingness
## [1] "coefficients" "residuals" "effects" "rank"
## [5] "fitted.values" "assign" "qr" "df.residual"
## [9] "na.action" "contrasts" "xlevels" "call"
## [13] "terms" "model"
## Df Sum Sq Mean Sq F value Pr(>F)
## EthnicPlurality 4 1.632e+12 4.080e+11 4996 <2e-16 ***
## Residuals 69566 5.681e+12 8.167e+07
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 3589 observations deleted due to missingness
## Call:
## aov(formula = IncomePerCap ~ WorkPlurality, data = anova_dat)
##
## Terms:
## WorkPlurality Residuals
## Sum of Squares 2.642140e+12 4.671134e+12
## Deg. of Freedom 6 69564
##
## Residual standard error: 8194.432
## Estimated effects may be unbalanced
## 3589 observations deleted due to missingness
## [1] "coefficients" "residuals" "effects" "rank"
## [5] "fitted.values" "assign" "qr" "df.residual"
## [9] "na.action" "contrasts" "xlevels" "call"
## [13] "terms" "model"
## Df Sum Sq Mean Sq F value Pr(>F)
## WorkPlurality 6 2.642e+12 4.404e+11 6558 <2e-16 ***
## Residuals 69564 4.671e+12 6.715e+07
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 3589 observations deleted due to missingness
| Work | Ethnicity | Total | ||||
|---|---|---|---|---|---|---|
| Asian | Black | Hispanic | Native | White | ||
| Construction |
0 18 |
31 104 |
639 137 |
1 3 |
358 767 |
1029 1029 |
| Error |
0 0 |
0 0 |
0 0 |
0 0 |
0 0 |
0 0 |
| Office |
128 192 |
1562 1087 |
2197 1424 |
17 30 |
6811 7982 |
10715 10715 |
| Production |
19 70 |
480 398 |
993 521 |
6 11 |
2422 2920 |
3920 3920 |
| Professional |
927 851 |
2345 4803 |
2506 6294 |
114 131 |
41474 35287 |
47366 47366 |
| SelfEmployed |
0 0 |
0 1 |
1 2 |
0 0 |
11 9 |
12 12 |
| Service |
240 174 |
2744 983 |
3273 1288 |
49 27 |
3388 7222 |
9694 9694 |
| Unemployment |
0 8 |
257 43 |
113 56 |
15 1 |
39 316 |
424 424 |
| Total |
1314 1314 |
7419 7419 |
9722 9722 |
202 202 |
54503 54503 |
73160 73160 |
χ2=NaN · df=28 · Cramer’s V=NaN · Fisher’s p=0.000 |
observed values
expected values
17 variables from the dataset were chosen to perform the regression. Using measures such as R-square, Adjusted R-Square, Complexity Parameter (CP) , Bayesian Information Criterion (BIC) and Residual Sum of Square (RSS) the best variable that could fit the model was selected. To begin with different measures v/s number of variables were plotted. ##Exhaustive Search From the distribution graph, almost all the variables are included which was not helpful. If we were to select the best variables then it would be ‘WHITE’, ‘Native’, ‘Asian’,‘‘Professional’,‘Office’,‘Construction’,‘Production’,‘Unemployment’, ‘Fiscal Policy’‘Personal Freedom’ and ‘service’ which is at highest R2 Value .68. Adjusted R2 - The highest adjusted R2 is obtained at .68 and this is similar to that of R2, the only difference is the variable ‘Service’ is excluded. For BIC and CP, the lowest values are 12 and 10 and they are obtained when we include 12 and 8 variables in the regression
## Reordering variables and trying again:
## [1] "np" "nrbar" "d" "rbar" "thetab" "first"
## [7] "last" "vorder" "tol" "rss" "bound" "nvmax"
## [13] "ress" "ir" "nbest" "lopt" "il" "ier"
## [19] "xnames" "method" "force.in" "force.out" "sserr" "intercept"
## [25] "lindep" "reorder" "nullrss" "nn" "call"
## [1] 14
## [1] 13
## [1] 12
R2 is not used as a criteria , which usually improves with number of variables and leads to overfitting. For Adjusted R2,we could choose the range of best variables from 4 to 10. The best variables are ‘WHITE’, ‘Professional’, ‘Unemployment’, ‘Personal Freedom’, ‘service’ which is at highest R2 Value .68. For BIC and CP, the lowest measure value variables are 12 and 10 respectively.
## Reordering variables and trying again:
Now backwards (nvmax=17 and nbest=2) Best variables from the range of 9 to 11 could be choosen and they are ‘WHITE’, ‘Native’, ‘Asian’,‘‘Professional’,‘Office’,‘Construction’, ‘Production’,‘Unemployment’, ‘Fiscal Policy’‘Personal Freedom’ and ‘service’. For BIC and CP, the lowest measure value variables are 12 and 10 respectively.
## Reordering variables and trying again:
Lastly Sequential Replacement. How accurate and precise are these models ? we don’t know yet, until we run some validation set approach or cross validations.
## Reordering variables and trying again: